Skip to content

Fix pods remaining pending after local volume release until manual intervention #505

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Copilot
Copy link

@Copilot Copilot AI commented Jul 27, 2025

This PR fixes an issue where pods would remain pending after a local volume was released until the pending pod was manually deleted and recreated.

Problem

When using local volumes with the local-static-provisioner:

  1. Pod A uses a local volume (PV becomes Bound)
  2. Pod B requests the same volume but remains Pending (can't bind to Bound PV)
  3. Pod A is deleted (PV transitions to Released state)
  4. Pod B continues to remain Pending indefinitely with "FailedScheduling" events showing "context deadline exceeded from PreBind VolumeBinding plugin"
  5. Only manually deleting and recreating Pod B allows it to bind successfully

Root Cause

The discovery process in pkg/discovery/discovery.go was skipping creation of new PVs when any existing PV was found, regardless of the PV's state. This meant Released PVs (which are not bindable) would prevent new bindable PVs from being created until the async cleanup process completed, creating a timing gap where no bindable PV existed.

Solution

Modified the discovery logic to handle Released/Failed PVs intelligently:

  • For Released/Failed PVs with Delete reclaim policy: Immediately delete the old PV and create a new Available PV
  • For Released/Failed PVs with Retain reclaim policy: Preserve existing behavior (skip creation)
  • For Available/Bound PVs: Preserve existing behavior (skip creation)

This eliminates the timing gap and ensures volumes become available for new pod binding immediately after the previous pod is deleted.

Changes

  • Modified discovery logic in pkg/discovery/discovery.go to check PV state and reclaim policy
  • Added comprehensive tests covering both Delete and Retain reclaim policy scenarios
  • All existing tests continue to pass

Impact

  • Eliminates the need for manual pod deletion/recreation after volume release
  • Improves user experience and reduces operational overhead
  • Maintains backward compatibility and existing semantics
  • Preserves intended behavior for different reclaim policies

Fixes #504.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 27, 2025
Copy link

linux-foundation-easycla bot commented Jul 27, 2025

CLA Missing ID CLA Not Signed

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 27, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @Copilot. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jul 27, 2025
- Modified discovery logic to handle Released PVs with Delete reclaim policy
- When discovery finds a Released PV with Delete policy, it immediately deletes the old PV and creates a new Available PV
- This eliminates the delay between volume release and availability for new pod binding
- Added comprehensive tests to validate the fix for both Delete and Retain reclaim policies
- All existing tests continue to pass

The fix ensures volumes become available for new pods immediately after release, solving the issue where pods would remain pending until manual intervention.

Co-authored-by: andyzhangx <[email protected]>
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Copilot
Once this PR has been reviewed and has the lgtm label, please ask for approval from andyzhangx. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jul 27, 2025
@Copilot Copilot AI changed the title [WIP] Pods Remain Pending After Local Volume Release Until Pod Deletion/Recreation Fix pods remaining pending after local volume release until manual intervention Jul 27, 2025
@Copilot Copilot AI requested a review from andyzhangx July 27, 2025 08:37
Copilot finished work on behalf of andyzhangx July 27, 2025 08:37
@andyzhangx
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 27, 2025
@k8s-ci-robot
Copy link
Contributor

@Copilot: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-sig-storage-local-static-provisioner-verify 537709f link true /test pull-sig-storage-local-static-provisioner-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pods Remain Pending After Local Volume Release Until Pod Deletion/Recreation
3 participants